{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TextAttack with Custom Dataset and Word Embedding. This tutorial will show you how to use textattack with any dataset and word embedding you may want to use\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/4_Custom_Datasets_Word_Embedding.ipynb)\n",
    "\n",
    "[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/4_Custom_Datasets_Word_Embedding.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_WVki6Bvbjur"
   },
   "source": [
    "## **Importing the Model**\n",
    "\n",
    "We start by choosing a pretrained model we want to attack. In this example we will use the albert base v2 model from HuggingFace. This model was trained with data from imbd, a set of movie reviews with either positive or negative labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 585,
     "referenced_widgets": [
      "1905ff29aaa242a88dc93f3247065364",
      "917713cc9b1344c7a7801144f04252bc",
      "b65d55c5b9f445a6bfd585f6237d22ca",
      "38b56a89b2ae4a8ca93c03182db26983",
      "26082a081d1c49bd907043a925cf88df",
      "1c3edce071ad4a2a99bf3e34ea40242c",
      "f9c265a003444a03bde78e18ed3f5a7e",
      "3cb9eb594c8640ffbfd4a0b1139d571a",
      "7d29511ba83a4eaeb4a2e5cd89ca1990",
      "136f44f7b8fa433ebff6d0a534c0588b",
      "2658e486ee77468a99ab4edc7b5191d8",
      "39bfd8c439b847e4bdfeee6e66ae86f3",
      "7ca4ce3d902d42758eb1fc02b9b211d3",
      "222cacceca11402db10ff88a92a2d31d",
      "108d2b83dff244edbebf4f8909dce789",
      "c06317aaf0064cb9b6d86d032821a8e2",
      "c18ac12f8c6148b9aa2d69885351fbcb",
      "b11ad31ee69441df8f0447a4ae62ce75",
      "a7e846fdbda740a38644e28e11a67707",
      "b38d5158e5584461bfe0b2f8ed3b0dc2",
      "3bdef9b4157e41f3a01f25b07e8efa48",
      "69e19afa8e2c49fbab0e910a5929200f",
      "2627a092f0c041c0a5f67451b1bd8b2b",
      "1780cb5670714c0a9b7a94b92ffc1819",
      "1ac87e683d2e4951ac94e25e8fe88d69",
      "02daee23726349a69d4473814ede81c3",
      "1fac551ad9d840f38b540ea5c364af70",
      "1027e6f245924195a930aca8c3844f44",
      "5b863870023e4c438ed75d830c13c5ac",
      "9ec55c6e2c4e40daa284596372728213",
      "5e2d17ed769d496db38d053cc69a914c",
      "dedaafae3bcc47f59b7d9b025b31fd0c",
      "8c2f5cda0ae9472fa7ec2b864d0bdc0e",
      "2a35d22dd2604950bae55c7c51f4af2c",
      "4c23ca1540fd48b1ac90d9365c9c6427",
      "3e4881a27c36472ab4c24167da6817cf",
      "af32025d22534f9da9e769b02f5e6422",
      "7af34c47299f458789e03987026c3519",
      "ed0ab8c7456a42618d6cbf6fd496b7b3",
      "25fc5fdac77247f9b029ada61af630fd"
     ]
    },
    "id": "4ZEnCFoYv-y7",
    "outputId": "c6c57cb9-6d6e-4efd-988f-c794356d4719"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "09e503d73c1042dfbc48e0148cfc9699",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=727.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "79a819e8b3614fe280209cbc93614ce3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=46747112.0, style=ProgressStyle(descrip…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8ac83c6df8b746c3af829996193292cf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "edb863a582ac4ee6a0f0ac064c335843",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=156.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ccaf4ac6d7e24cc5b5e320f128a11b68",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=25.0, style=ProgressStyle(description_w…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import transformers\n",
    "from textattack.models.wrappers import HuggingFaceModelWrapper\n",
    "\n",
    "# https://huggingface.co/textattack\n",
    "model = transformers.AutoModelForSequenceClassification.from_pretrained(\"textattack/albert-base-v2-imdb\")\n",
    "tokenizer = transformers.AutoTokenizer.from_pretrained(\"textattack/albert-base-v2-imdb\")\n",
    "# We wrap the model so it can be used by textattack\n",
    "model_wrapper = HuggingFaceModelWrapper(model, tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D61VLa8FexyK"
   },
   "source": [
    "## **Creating A Custom Dataset**\n",
    "\n",
    "Textattack takes in dataset in the form of a list of tuples. The tuple can be in the form of (\"string\", label) or (\"string\", label, label). In this case we will use former one, since we want to create a custom movie review dataset with label 0 representing a positive review, and label 1 representing a negative review.\n",
    "\n",
    "For simplicity, I created a dataset consisting of 4 reviews, the 1st and 4th review have \"correct\" labels, while the 2nd and 3rd review have \"incorrect\" labels. We will see how this impacts perturbation later in this tutorial.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "nk_MUu5Duf1V"
   },
   "outputs": [],
   "source": [
    "# dataset: An iterable of (text, ground_truth_output) pairs.\n",
    "#0 means the review is negative\n",
    "#1 means the review is positive\n",
    "custom_dataset = [\n",
    "    ('I hate this movie', 0), #A negative comment, with a negative label\n",
    "    ('I hate this movie', 1), #A negative comment, with a positive label\n",
    "    ('I love this movie', 0), #A positive comment, with a negative label\n",
    "    ('I love this movie', 1), #A positive comment, with a positive label\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ijVmi6PbiUYZ"
   },
   "source": [
    "## **Creating An Attack**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-iEH_hf6iMEw",
    "outputId": "0c836c5b-ddd5-414d-f73d-da04067054d8"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "textattack: Unknown if model of class <class 'transformers.models.albert.modeling_albert.AlbertForSequenceClassification'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.\n"
     ]
    }
   ],
   "source": [
    "from textattack import Attack\n",
    "from textattack.search_methods import GreedySearch\n",
    "from textattack.constraints.pre_transformation import RepeatModification, StopwordModification\n",
    "from textattack.goal_functions import UntargetedClassification\n",
    "from textattack.transformations import WordSwapEmbedding\n",
    "from textattack.constraints.pre_transformation import RepeatModification\n",
    "from textattack.constraints.pre_transformation import StopwordModification\n",
    "\n",
    "# We'll use untargeted classification as the goal function.\n",
    "goal_function = UntargetedClassification(model_wrapper)\n",
    "# We'll to use our WordSwapEmbedding as the attack transformation.\n",
    "transformation = WordSwapEmbedding() \n",
    "# We'll constrain modification of already modified indices and stopwords\n",
    "constraints = [RepeatModification(),\n",
    "               StopwordModification()]\n",
    "# We'll use the Greedy search method\n",
    "search_method = GreedySearch()\n",
    "# Now, let's make the attack from the 4 components:\n",
    "attack = Attack(goal_function, constraints, transformation, search_method)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4hUA8ntnfJzH"
   },
   "source": [
    "## **Attack Results With Custom Dataset**\n",
    "\n",
    "As you can see, the attack fools the model by changing a few words in the 1st and 4th review.\n",
    "\n",
    "The attack skipped the 2nd and and 3rd review because since it they were labeled incorrectly, they managed to fool the model without any modifications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-ivoHEOXfIfN",
    "outputId": "9ec660b6-44fc-4354-9dd1-1641b6f4c986"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[91m0 (99%)\u001b[0m --> \u001b[92m1 (81%)\u001b[0m\n",
      "\n",
      "\u001b[91mI\u001b[0m \u001b[91mhate\u001b[0m this \u001b[91mmovie\u001b[0m\n",
      "\n",
      "\u001b[92mdid\u001b[0m \u001b[92mhateful\u001b[0m this \u001b[92mfootage\u001b[0m\n",
      "\u001b[91m0 (99%)\u001b[0m --> \u001b[37m[SKIPPED]\u001b[0m\n",
      "\n",
      "I hate this movie\n",
      "\u001b[92m1 (96%)\u001b[0m --> \u001b[37m[SKIPPED]\u001b[0m\n",
      "\n",
      "I love this movie\n",
      "\u001b[92m1 (96%)\u001b[0m --> \u001b[91m0 (99%)\u001b[0m\n",
      "\n",
      "I \u001b[92mlove\u001b[0m this movie\n",
      "\n",
      "I \u001b[91miove\u001b[0m this movie\n"
     ]
    }
   ],
   "source": [
    "for example, label in custom_dataset:\n",
    "    result = attack.attack(example, label)\n",
    "    print(result.__str__(color_method='ansi'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "foFZmk8vY5z0"
   },
   "source": [
    "## **Creating A Custom Word Embedding**\n",
    "\n",
    "In textattack, a pre-trained word embedding is necessary in transformation in order to find synonym replacements, and in constraints to check the semantic validity of the transformation. To use custom pre-trained word embeddings, you can either create a new class that inherits the AbstractWordEmbedding class, or use the WordEmbedding class which takes in 4 parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "owj_jMHRxEF5"
   },
   "outputs": [],
   "source": [
    "from textattack.shared import WordEmbedding\n",
    "\n",
    "embedding_matrix = [[1.0], [2.0], [3.0], [4.0]] #2-D array of shape N x D where N represents size of vocab and D is the dimension of embedding vectors.\n",
    "word2index = {\"hate\":0, \"despise\":1, \"like\":2, \"love\":3} #dictionary that maps word to its index with in the embedding matrix.\n",
    "index2word = {0:\"hate\", 1: \"despise\", 2:\"like\", 3:\"love\"} #dictionary that maps index to its word.\n",
    "nn_matrix = [[0, 1, 2, 3], [1, 0, 2, 3], [2, 1, 3, 0], [3, 2, 1, 0]] #2-D integer array of shape N x K where N represents size of vocab and K is the top-K nearest neighbours.\n",
    "\n",
    "embedding = WordEmbedding(embedding_matrix, word2index, index2word, nn_matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s9ZEV_ykhmBn"
   },
   "source": [
    "## **Attack Results With Custom Dataset and Word Embedding**\n",
    "\n",
    "Now if we run the attack again with the custom word embedding, you will notice the modifications are limited to the vocab provided by our custom word embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "gZ98UZ6I5sIn",
    "outputId": "59a653cb-85cb-46b5-d81b-c1a05ebe8a3e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[91m0 (99%)\u001b[0m --> \u001b[92m1 (98%)\u001b[0m\n",
      "\n",
      "I \u001b[91mhate\u001b[0m this movie\n",
      "\n",
      "I \u001b[92mlike\u001b[0m this movie\n",
      "\u001b[91m0 (99%)\u001b[0m --> \u001b[37m[SKIPPED]\u001b[0m\n",
      "\n",
      "I hate this movie\n",
      "\u001b[92m1 (96%)\u001b[0m --> \u001b[37m[SKIPPED]\u001b[0m\n",
      "\n",
      "I love this movie\n",
      "\u001b[92m1 (96%)\u001b[0m --> \u001b[91m0 (99%)\u001b[0m\n",
      "\n",
      "I \u001b[92mlove\u001b[0m this movie\n",
      "\n",
      "I \u001b[91mdespise\u001b[0m this movie\n"
     ]
    }
   ],
   "source": [
    "from textattack.attack_results import SuccessfulAttackResult\n",
    "\n",
    "transformation = WordSwapEmbedding(3, embedding) \n",
    "\n",
    "attack = Attack(goal_function, constraints, transformation, search_method)\n",
    "\n",
    "for example, label in custom_dataset:\n",
    "    result = attack.attack(example, label)\n",
    "    print(result.__str__(color_method='ansi'))"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "Custom Data and Embedding with TextAttack.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
